1 Introduction

Gross Primary Production (GPP) represents the total carbon uptake by vegetation and is influenced by multiple environmental variables. Using high-frequency half-hourly flux data, this study aims to identify the key predictors of GPP. A stepwise forward regression algorithm is implemented from scratch in R to build a predictive model and select the most relevant variables.


2 Reasearch question

Which environmental variables best predict GPP in half-hourly flux data, as determined by a stepwise forward regression approach?


3 Available data

The dataset, obtained from this source, contains daily aggregated measurements from four flux tower sites in Switzerland (see Table 1). The target variable for our modeling is gross primary production (GPP_NT_VUT_REF), while the other variables in the original dataset are summarized in Table 2.

For modeling GPP, we selected a subset of predictor variables based on data quality. Variables ending with _F were prioritized because they generally exhibit higher data completeness. In contrast, most variables ending with _F_MDS—which are gap-filled using the Marginal Distribution Sampling (MDS) method—were excluded due to substantial missing values. Nevertheless, key predictors such as CO2_F_MDS, PPFD_IN, and USTAR were retained despite some missing data.

Table 1: Site Information of the Raw Data
Site Start date End date
CH-Dav 1997-01-01 2014-12-31
CH-Lae 2004-01-01 2014-12-31
FI-Hyy 1996-01-01 2014-12-31
FR-Pue 2000-01-01 2014-12-31
Table 2: Variable Descriptions
id variable units description
1 TA_F °C Air temperature, consolidated from TA_F_MDS and TA_ERA
2 SW_IN_F W/m² Shortwave radiation, incoming, consolidated from SW_IN_F_MDS and SW_IN_ERA (negative values set to zero)
3 LW_IN_F W/m² Longwave radiation, incoming, consolidated from LW_IN_F_MDS and LW_IN_ERA
4 VPD_F hPa Vapor Pressure Deficit, consolidated from VPD_F_MDS and VPD_ERA
5 PA_F hPa Atmospheric pressure, consolidated from PA and PA_ERA
6 P_F mm Precipitation, consolidated from P and P_ERA
7 WS_F m/s Wind speed, consolidated from WS and WS_ERA
8 TA_F_MDS °C Air temperature, gap-filled using MDS (Marginal Distribution Sampling)
9 SW_IN_F_MDS W/m² Shortwave radiation, incoming, gap-filled using MDS (negative values set to zero)
10 LW_IN_F_MDS W/m² Longwave radiation, incoming, gap-filled using MDS
11 VPD_F_MDS hPa Vapor Pressure Deficit, gap-filled using MDS
12 CO2_F_MDS ppm COâ‚‚ mole fraction, gap-filled with MDS
13 PPFD_IN µmol/m²/s Photosynthetic photon flux density, incoming
14 USTAR m/s Friction velocity, indicates atmospheric turbulence
15 GPP_NT_VUT_REF µmol CO₂/m²/s Gross Primary Production, from Nighttime partitioning method

4 Implementation

4.1 Bivariate Model Evaluation

In this section, we will implement the evaluation of all possible bivariate models, where each model includes a single predictor to predict the target variable GPP. The goal is to identify the best-fitting model based on the R² and AIC values.

4.1.1 Steps:

  1. Set One Predictor (p = 1): For each variable, we will fit a simple linear regression model.
  2. Fit All Regression Models: Fit a linear regression model for each predictor and compute the R² and AIC value.
  3. Select Best Model: Identify the model with the highest R² and compute its AIC value.
  4. Output: Display the best-fitting model based on R² and its corresponding AIC value.

4.1.2 Main Implementation Function

process_models <- function(df, variables, variable_metadata, target = "GPP_NT_VUT_REF") {
  # Initialize storage objects
  models <- list()      # Stores fitted linear models
  var_names <- list()   # Stores variable names for display
  formulas <- list()    # Stores model formulas
  AICs <- numeric()     # Stores AIC values
  R2s <- numeric()      # Stores R-squared values
  plots <- list()       # Stores diagnostic plots
  best_model <- NULL    # Will track the best performing model
  
  # Process each predictor variable
  for (var in variables) {
    # Prepare data: select target and predictor, remove missing values
    df_model <- df |>
      dplyr::select(all_of(c(target, var))) |>
      na.omit()
    
    # Create and fit linear model
    formula <- as.formula(paste(target, "~", var))
    model <- lm(formula, data = df_model)
    
    # Store model components
    formulas[[var]] <- formula
    models[[var]] <- model
    AICs[[var]] <- AIC(model)
    R2s[[var]] <- summary(model)$r.squared
    
    # Extract metadata for labeling
    var_index <- which(variable_metadata$variable == var)
    var_name <- sub(",.*", "", variable_metadata$description[var_index])
    var_expr <- paste0(var_name, " (", variable_metadata$units[var_index], ")")
    
    # Generate diagnostic plot
    plots[[var]] <- plot_bi(df_model, var, var_expr)
    var_names[[var]] <- var_name
    
    # Print progress information
    message("Processed: ", var,
            " | R² = ", round(R2s[[var]], 3),
            " | AIC = ", round(AICs[[var]], 1))
    
    # Update best model if current model performs better
    if (is.null(best_model) || R2s[[var]] > R2s[[best_model]]) {
      best_model <- var
    }
  }
  
  # Compile best model information
  best_model_info <- list(
    model = best_model,
    R2 = R2s[[best_model]],
    AIC = AICs[[best_model]]
  )
  
  # Return all results
  return(list(
    formulas = formulas,
    models = models,
    AICs = AICs,
    R2s = R2s,
    plots = plots,
    var_names = var_names,
    best_model_info = best_model_info
  ))
}
  • df_clean: This is the dataset that contains the target variable (GPP_NT_VUT_REF) and the predictors.
  • predictors: The list of predictor variable names to evaluate in the bivariate models.
  • lm(formula, data = df_model): This fits a linear regression model to predict the target variable based on the selected predictor.
  • R² and AIC: These metrics are calculated for each model. R² measures the proportion of variance in the target variable explained by the model, and AIC helps identify the model that balances fit and complexity.
  • Best Model Selection: The function identifies the model with the highest R² value and prints out the corresponding AIC for the best-fitting model.

5 Results and Discussion

As shown in Figure 1, model performance varies across predictors, with PPFD_IN showing the best combination of low AIC and high R². PPFD_IN represents the amount of photosynthetically active radiation (PAR) received at the canopy level, which is a critical driver of photosynthesis in plants. The relatively high R² value of 0.45 indicates that nearly 45% of the variation in GPP can be explained by changes in incoming light alone. This result is consistent with ecological understanding, as light availability is one of the primary environmental factors controlling carbon assimilation in terrestrial ecosystems. The AIC value of 77577.9 suggests that, among all tested variables, the model using PPFD_IN balances explanatory power and model complexity most effectively.

Comparison of model performance using AIC and R² metrics

Figure 1: Comparison of model performance using AIC and R² metrics

Analysis of the individual models presented in Figures 2 and 1 reveals several key insights for GPP prediction modeling. Incoming shortwave radiation emerges as a consistently strong predictor, showing comparable importance across different model configurations. This predictive power likely stems from its high correlation with PPFD, as both variables capture similar light availability information for photosynthesis. Given this substantial collinearity, modelers should implement a strict either-or selection between these two radiation variables to maintain model robustness and avoid inflation of variance estimates.

Beyond radiation variables, air temperature demonstrates particularly strong explanatory power for GPP variation, especially in temperate ecosystems where thermal constraints significantly influence photosynthetic activity. The nearly universal significance of temperature across model permutations suggests it should be prioritized in subsequent model development. Secondary environmental controls including vapor pressure deficit and incoming longwave radiation show more context-dependent importance, with their inclusion value potentially varying by biome type or temporal scale of analysis.

Diagnostic plots for all single-predictor models

Figure 2: Diagnostic plots for all single-predictor models